library(data.table)
library(dplyr)
library(ggplot2)

Data Cleaning and Validation

X2016 <- read.csv("world-happiness-2/2016.csv")
data = as.data.table(X2016)
names = c("Country", "Region", "HappinessRank", "Score", "LowerCI", "UpperCI", "GDP", "Family", "LifeExp", "Freedom", "Corruption", "Generosity", "DystopianResidual")
names(data) = names

First, we need to import the raw CSV file, read it in as a data frame, and convert it to a data table in order to use R packages such as dplyr. Next, we rename the columns with appropriate headers and validated the classes of each column, which were the variables GDP, freedom, life expectancy, family, corruption, generosity, and dystopian residual. In order to validate the data, we had to make sure variables such as country and region were factors with 157 and 10 levels, respectively, representing the 157 countries and 10 regions of the world. Happiness rank is an integer representing the rank of the country’s happiness score, given as a num. All other variables, excluding country, region, and happiness rank, are supposed to be numeric decimal values. The descriptions of the columns are given as:

Data Analysis

summary(data)
##         Country                                Region   HappinessRank   
##  Afghanistan:  1   Sub-Saharan Africa             :38   Min.   :  1.00  
##  Albania    :  1   Central and Eastern Europe     :29   1st Qu.: 40.00  
##  Algeria    :  1   Latin America and Caribbean    :24   Median : 79.00  
##  Angola     :  1   Western Europe                 :21   Mean   : 78.98  
##  Argentina  :  1   Middle East and Northern Africa:19   3rd Qu.:118.00  
##  Armenia    :  1   Southeastern Asia              : 9   Max.   :157.00  
##  (Other)    :151   (Other)                        :17                   
##      Score          LowerCI         UpperCI           GDP        
##  Min.   :2.905   Min.   :2.732   Min.   :3.078   Min.   :0.0000  
##  1st Qu.:4.404   1st Qu.:4.327   1st Qu.:4.465   1st Qu.:0.6702  
##  Median :5.314   Median :5.237   Median :5.419   Median :1.0278  
##  Mean   :5.382   Mean   :5.282   Mean   :5.482   Mean   :0.9539  
##  3rd Qu.:6.269   3rd Qu.:6.154   3rd Qu.:6.434   3rd Qu.:1.2796  
##  Max.   :7.526   Max.   :7.460   Max.   :7.669   Max.   :1.8243  
##                                                                  
##      Family          LifeExp          Freedom         Corruption     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.6418   1st Qu.:0.3829   1st Qu.:0.2575   1st Qu.:0.06126  
##  Median :0.8414   Median :0.5966   Median :0.3975   Median :0.10547  
##  Mean   :0.7936   Mean   :0.5576   Mean   :0.3710   Mean   :0.13762  
##  3rd Qu.:1.0215   3rd Qu.:0.7299   3rd Qu.:0.4845   3rd Qu.:0.17554  
##  Max.   :1.1833   Max.   :0.9528   Max.   :0.6085   Max.   :0.50521  
##                                                                      
##    Generosity     DystopianResidual
##  Min.   :0.0000   Min.   :0.8179   
##  1st Qu.:0.1546   1st Qu.:2.0317   
##  Median :0.2225   Median :2.2907   
##  Mean   :0.2426   Mean   :2.3258   
##  3rd Qu.:0.3119   3rd Qu.:2.6646   
##  Max.   :0.8197   Max.   :3.8377   
## 
plot(data)

datasubset = data[, c(4, 7,8,9,10,11,12)]
cor(datasubset)
##                Score         GDP     Family    LifeExp   Freedom
## Score      1.0000000  0.79032202 0.73925158 0.76538433 0.5668267
## GDP        0.7903220  1.00000000 0.66953969 0.83706723 0.3622828
## Family     0.7392516  0.66953969 1.00000000 0.58837678 0.4502082
## LifeExp    0.7653843  0.83706723 0.58837678 1.00000000 0.3411993
## Freedom    0.5668267  0.36228285 0.45020820 0.34119929 1.0000000
## Corruption 0.4020322  0.29418478 0.21356094 0.24958329 0.5020540
## Generosity 0.1568478 -0.02553066 0.08962885 0.07598731 0.3617513
##            Corruption  Generosity
## Score       0.4020322  0.15684780
## GDP         0.2941848 -0.02553066
## Family      0.2135609  0.08962885
## LifeExp     0.2495833  0.07598731
## Freedom     0.5020540  0.36175133
## Corruption  1.0000000  0.30592986
## Generosity  0.3059299  1.00000000

First, we would like to see the summary statistics for each variable and plot the pairwise relationships to make initial observations. We notice that the ranges of variables including generosity, corruption, and freedom, are not as large as the ranges of variables like GDP, family, and life expectancy.

From the plots of the variables as well as the correlation table, we can see some relationships with linear trend and strong correlation, so we must investigate further.

We can visualize the breakdown of the Happiness Scores for each Region of the world, and compare the means of each Region to the countries within it, as well as with other Regions.

It appears as if the Australia and New Zealand Region has the highest mean Happiness Score of 7.32, and the Sub-Subharan Africa Region has the lowest mean Score of 4.14. We can see a large range of scores within each region such as the Latin America and Caribbean and Middle East and Northern Africa Regions due to a large number of countries in that region, and smaller ranges for the Australia and New Zealand and North America Regions due to only a few countries being in that region.

We want to break down these statistics by Region even further, so we can look at how each factor such as GDP, freedom, life expectancy, family, corruption, generosity and dystopian residual can be influenced by the Region and see if that can show how the overall happiness score is affected.

Visualizations of Summaries of Factors

All the plots below show how much each variable contributes to the overall Happiness Score.

ovr = data %>% group_by(Region) %>% summarise(GDP = mean(GDP))

ggplot(data, aes(Region, GDP, color = Region)) + geom_point()+ geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(GDP, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much GDP Contributes to Happiness")

ovr = data %>% group_by(Region) %>% summarise(Family = mean(Family))

ggplot(data, aes(Region, Family, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(Family, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Family Contributes to Happiness")

ovr = data %>% group_by(Region) %>% summarise(LifeExp = mean(LifeExp))

ggplot(data, aes(Region, LifeExp, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(LifeExp, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Life Expectency Contributes to Happiness")

ovr = data %>% group_by(Region) %>% summarise(Freedom = mean(Freedom))

ggplot(data, aes(Region, Freedom, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(Freedom, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Freedom Contributes to Happiness")

ovr = data %>% group_by(Region) %>% summarise(Corruption = mean(Corruption))

ggplot(data, aes(Region, Corruption, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(Corruption, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Government Corruption Contributes to Happiness")

ovr = data %>% group_by(Region) %>% summarise(Generosity = mean(Generosity))

ggplot(data, aes(Region, Generosity, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(Generosity, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much Generosity Contributes to Happiness")

ovr = data %>% group_by(Region) %>% summarise(DystopianResidual = mean(DystopianResidual))

ggplot(data, aes(Region, DystopianResidual, color = Region)) + geom_point() + geom_point(data = ovr, size = 4, alpha = .8)  + geom_text(data = ovr, aes(label = round(DystopianResidual, 2)), hjust = 0, nudge_x = .1, vjust = 0, nudge_y = 0.1) + theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank()) + labs(title = "How Much the Dystopian Residual Contributes to Happiness")

We observe that the GDP, family, life expectancy, and freedom variables show a clear difference between the means and ranges of factors within the regions. Other variables like corruption, generosity, and dystopian residual do not differ as much region to region. Now that we have seen how each variable contributes to the happiness score in each region, we can look at the relationship between a country’s score and how much each factor contributes to the score. We can also see if a variable appears to have a strong linear trend with respect to a country’s score.

Modeling Happiness Score

We want to create a model to predict the happiness score from the factors: GDP, freedom, life expectancy, family, corruption, and generosity. As mentioned earlier, dystopian residual is excluded as it is used more as a comparison statistic rather than using it to predict a happiness score.

Using forward selection and the F test statistic, we add the variable with the largest F value (smallest p-value) to the model in order to minimize the AIC value. The process of adding variables to a model one-by-one is shown below:

score_fit = lm(Score ~ 1, data = data)
indep.vars = ~ GDP + Family + LifeExp + Freedom + Corruption + Generosity
add1(score_fit, indep.vars, test = "F")
## Single term additions
## 
## Model:
## Score ~ 1
##            Df Sum of Sq     RSS      AIC  F value    Pr(>F)    
## <none>                  203.333   42.600                       
## GDP         1   127.004  76.330 -109.226 257.9027 < 2.2e-16 ***
## Family      1   111.120  92.213  -79.547 186.7808 < 2.2e-16 ***
## LifeExp     1   119.115  84.218  -93.786 219.2273 < 2.2e-16 ***
## Freedom     1    65.329 138.004  -16.247  73.3752 1.005e-14 ***
## Corruption  1    32.865 170.469   16.922  29.8826 1.798e-07 ***
## Generosity  1     5.002 198.331   40.690   3.9094   0.04979 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + GDP)
add1(score_fit, indep.vars, test = "F")
## Single term additions
## 
## Model:
## Score ~ GDP
##            Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                  76.330 -109.23                      
## Family      1   16.2683 60.061 -144.86  41.713 1.312e-09 ***
## LifeExp     1    7.3238 69.006 -123.06  16.344 8.320e-05 ***
## Freedom     1   18.4162 57.913 -150.58  48.971 7.490e-11 ***
## Corruption  1    6.3977 69.932 -120.97  14.089 0.0002466 ***
## Generosity  1    6.3762 69.953 -120.92  14.037 0.0002529 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + Freedom)
add1(score_fit, indep.vars, test = "F")
## Single term additions
## 
## Model:
## Score ~ GDP + Freedom
##            Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                  57.913 -150.58                      
## Family      1    8.2876 49.626 -172.82 25.5514 1.215e-06 ***
## LifeExp     1    5.7291 52.184 -164.93 16.7973 6.727e-05 ***
## Corruption  1    0.4853 57.428 -149.90  1.2929    0.2573    
## Generosity  1    0.7921 57.121 -150.74  2.1216    0.1473    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + Family)
add1(score_fit, indep.vars, test = "F")
## Single term additions
## 
## Model:
## Score ~ GDP + Freedom + Family
##            Df Sum of Sq    RSS     AIC F value    Pr(>F)    
## <none>                  49.626 -172.82                      
## LifeExp     1    5.0887 44.537 -187.81 17.3672 5.153e-05 ***
## Corruption  1    1.1562 48.470 -174.52  3.6257   0.05878 .  
## Generosity  1    0.6567 48.969 -172.91  2.0383   0.15543    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + LifeExp)
add1(score_fit, indep.vars, test = "F")
## Single term additions
## 
## Model:
## Score ~ GDP + Freedom + Family + LifeExp
##            Df Sum of Sq    RSS     AIC F value  Pr(>F)  
## <none>                  44.537 -187.81                  
## Corruption  1   1.27522 43.262 -190.37  4.4510 0.03653 *
## Generosity  1   0.20506 44.332 -186.53  0.6985 0.40462  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
score_fit = update(score_fit, . ~ . + Corruption)
add1(score_fit, indep.vars, test = "F")
## Single term additions
## 
## Model:
## Score ~ GDP + Freedom + Family + LifeExp + Corruption
##            Df Sum of Sq    RSS     AIC F value Pr(>F)
## <none>                  43.262 -190.37               
## Generosity  1  0.055882 43.206 -188.57   0.194 0.6602
summary(score_fit)
## 
## Call:
## lm(formula = Score ~ GDP + Freedom + Family + LifeExp + Corruption, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48327 -0.28172 -0.02771  0.32803  1.46147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2119     0.1502  14.731  < 2e-16 ***
## GDP           0.6971     0.2094   3.329   0.0011 ** 
## Freedom       1.5588     0.3733   4.175 5.01e-05 ***
## Family        1.2344     0.2289   5.393 2.62e-07 ***
## LifeExp       1.4623     0.3430   4.263 3.53e-05 ***
## Corruption    0.9590     0.4546   2.110   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5353 on 151 degrees of freedom
## Multiple R-squared:  0.7872, Adjusted R-squared:  0.7802 
## F-statistic: 111.7 on 5 and 151 DF,  p-value: < 2.2e-16

As shown in the summary, the final model is Score ~ GDP + Freedom + Family + LifeExp + Corruption, with a p-value of < 2.2e-16, which is statistically significant.

In order to see if there are other possible models, we can use the step function, which adds or removes the independent variables GDP, Family, LifeExp, Freedom, Corruption, and Generosity to the model to minimize AIC.

score_aic = lm(Score ~ 1, data = data)
indep.vars = ~ GDP + Family + LifeExp + Freedom + Corruption + Generosity 
score_out = step(score_aic, scope = indep.vars, direction = "both")
## Start:  AIC=42.6
## Score ~ 1
## 
##              Df Sum of Sq     RSS      AIC
## + GDP         1   127.004  76.330 -109.226
## + LifeExp     1   119.115  84.218  -93.786
## + Family      1   111.120  92.213  -79.547
## + Freedom     1    65.329 138.004  -16.247
## + Corruption  1    32.865 170.469   16.922
## + Generosity  1     5.002 198.331   40.690
## <none>                    203.333   42.600
## 
## Step:  AIC=-109.23
## Score ~ GDP
## 
##              Df Sum of Sq     RSS     AIC
## + Freedom     1    18.416  57.913 -150.58
## + Family      1    16.268  60.061 -144.86
## + LifeExp     1     7.324  69.006 -123.06
## + Corruption  1     6.398  69.932 -120.97
## + Generosity  1     6.376  69.953 -120.92
## <none>                     76.330 -109.23
## - GDP         1   127.004 203.333   42.60
## 
## Step:  AIC=-150.58
## Score ~ GDP + Freedom
## 
##              Df Sum of Sq     RSS      AIC
## + Family      1     8.288  49.626 -172.823
## + LifeExp     1     5.729  52.184 -164.930
## + Generosity  1     0.792  57.121 -150.738
## <none>                     57.913 -150.576
## + Corruption  1     0.485  57.428 -149.897
## - Freedom     1    18.416  76.330 -109.226
## - GDP         1    80.090 138.004  -16.247
## 
## Step:  AIC=-172.82
## Score ~ GDP + Freedom + Family
## 
##              Df Sum of Sq    RSS     AIC
## + LifeExp     1    5.0887 44.537 -187.81
## + Corruption  1    1.1562 48.470 -174.52
## + Generosity  1    0.6567 48.969 -172.91
## <none>                    49.626 -172.82
## - Family      1    8.2876 57.913 -150.58
## - Freedom     1   10.4355 60.061 -144.86
## - GDP         1   28.6222 78.248 -103.33
## 
## Step:  AIC=-187.81
## Score ~ GDP + Freedom + Family + LifeExp
## 
##              Df Sum of Sq    RSS     AIC
## + Corruption  1    1.2752 43.262 -190.37
## <none>                    44.537 -187.81
## + Generosity  1    0.2051 44.332 -186.53
## - GDP         1    3.8695 48.407 -176.73
## - LifeExp     1    5.0887 49.626 -172.82
## - Family      1    7.6472 52.184 -164.93
## - Freedom     1    9.5958 54.133 -159.17
## 
## Step:  AIC=-190.37
## Score ~ GDP + Freedom + Family + LifeExp + Corruption
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    43.262 -190.37
## + Generosity  1    0.0559 43.206 -188.57
## - Corruption  1    1.2752 44.537 -187.81
## - GDP         1    3.1742 46.436 -181.25
## - Freedom     1    4.9945 48.256 -175.22
## - LifeExp     1    5.2078 48.470 -174.52
## - Family      1    8.3320 51.594 -164.72
summary(score_out)
## 
## Call:
## lm(formula = Score ~ GDP + Freedom + Family + LifeExp + Corruption, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48327 -0.28172 -0.02771  0.32803  1.46147 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2119     0.1502  14.731  < 2e-16 ***
## GDP           0.6971     0.2094   3.329   0.0011 ** 
## Freedom       1.5588     0.3733   4.175 5.01e-05 ***
## Family        1.2344     0.2289   5.393 2.62e-07 ***
## LifeExp       1.4623     0.3430   4.263 3.53e-05 ***
## Corruption    0.9590     0.4546   2.110   0.0365 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5353 on 151 degrees of freedom
## Multiple R-squared:  0.7872, Adjusted R-squared:  0.7802 
## F-statistic: 111.7 on 5 and 151 DF,  p-value: < 2.2e-16

As we can see from the output of the stepwise function, we get the same model as before: Score ~ GDP + Freedom + Family + LifeExp + Corruption. This model uses all of the factors except generosity, which is not statistically significant to include. We can see how well this model fits the data by observing the plots below:

par(mfrow = c(2,2))
plot(score_fit)

critval = qt(0.05/(2*nobs(score_fit)), df=df.residual(score_fit)-1, lower=FALSE)
which(abs(rstudent(score_fit)) > critval)
## named integer(0)

The residuals plot shows that points are equally spread and fall on a flat line, so the constant variance assumption appears to be satisfied. The points on the Normal Q-Q plot generally fall on the line, except for at the extremes. This means the normality assumption also holds. There are no outliers or influential points, as shown in the plots and the Bonferroni adjusted critical value.

plot(data$Score, score_fit$fitted.values)
abline(0, 1, col = "red")

This plot confirms that the model’s predicted values closely match the observed happiness scores when using the six predictor variables to predict scores. We conclude that the final model Score ~ GDP + Freedom + Family + LifeExp + Corruption fits our data well.